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Abstract 

Common statistical practice has shown that the 
full power of Bayesian methods is not realized 
until hierarchical priors are used, as these allow 
for greater “robustness” and the ability to “share 
statistical strength.” Yet it is an ongoing chal¬ 
lenge to provide a learning-theoretically sound 
formalism of such notions that; offers practical 
guidance concerning when and how best to uti¬ 
lize hierarchical models; provides insights into 
what makes for a good hierarchical prior; and, 
when the form of the prior has been chosen, can 
guide the choice of hyperparameter settings. We 
present a set of analytical tools for understanding 
hierarchical priors in both the online and batch 
learning settings. We provide regret bounds un¬ 
der log-loss, which show how certain hierarchi¬ 
cal models compare, in retrospect, to the best 
single model in the model class. We also show 
how to convert a Bayesian log-loss regret bound 
into a Bayesian risk bound for any bounded loss, 
a result which may be of independent interest. 

Risk and regret bounds for Student’s t and hi¬ 
erarchical Gaussian priors allow us to formalize 
the concepts of “robustness” and “sharing statis¬ 
tical strength.” Priors for feature selection are in¬ 
vestigated as well. Our results suggest that the 
learning-theoretic benefits of using hierarchical 
priors can often come at little cost on practical 
problems. 

1. Introduction 

There are two standard justifications for the use of hierar¬ 
chical models. The first is that they allow for the represen¬ 
tation of greater uncertainty by placing “hyperpriors” on 
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the hyperparameters of the prior distribution (Berger, 1985; 
Bernardo & Smith, 2000; Gelman et al., 2013). By ex¬ 
plicitly modeling the additional uncertainty, there is greater 
“robustness” to misspecification and unexpected data. The 
second is that hierarchical models permit the “sharing of 
statistical strength” between related observations or co¬ 
horts (Gelman et al., 2013). For example, take the recent 
“Big Bayes Stories” special issue of the journal Statisti¬ 
cal Science, which was comprised of short articles describ¬ 
ing successful applications of Bayesian models to a diverse 
range of problems, including political science, astronomy, 
and public health (Mengersen & Robert, 2014). Most of 
the Bayesian models were hierarchical, and the need for ro¬ 
bustness and sharing of statistical strength because of lim¬ 
ited data were commonly cited reasons by the practitioners 
for choosing a hierarchical Bayesian approach. Gelman & 
Hill (2006) and Gelman et al. (2013) both contain further 
examples of problems in which hierarchical modeling is 
critical to obtaining high-quality inferences. 

Within the machine learning and vision literature, 
Salakhutdinov et al. (2011) offers an illustrative case study 
in the benefits and the pitfalls of employing a hierarchi¬ 
cal model. The motivation of Salakhutdinov et al. (2011) 
was that, for image classification tasks, some categories 
of objects (e.g., “car” or “dog”) have many labeled pos¬ 
itive and negative examples while other, visually related, 
categories (e.g., “bus” or “anteater”) have only a few la¬ 
beled examples. Fig. 1 (right, a) shows the distribution of 
training examples for the 200 object categories used while 
Fig. 1 (right, b) shows the same distribution, but now ob¬ 
jects are grouped with those with similar appearances. In 
both cases, the distributions are fat-tailed; there are a few 
categories with many training examples and many cate¬ 
gories with a few training examples. It was hypothesized 
that by using a hierarchical Bayesian model, the classes 
with large amounts of labeled data could be used to con¬ 
struct better classifiers for the classes with small amounts 
of labeled data. 

The model used by Salakhutdinov et al. (2011), which 
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we will analyze in Section 4.2, consisted of a hierarchical 
Gaussian prior with a logistic regression likelihood. Two- 
level, one-level, and flat priors were all tested. The purpose 
of using the two-level prior was that it was able to encode 
information about which object classes had visually simi¬ 
lar objects (e.g., car and track, dog and horse). Fig. l(left) 
compares the predictive performance of this two-level hi¬ 
erarchical prior with the two more impoverished priors. 
Observe that the one-level and two-level priors both im¬ 
prove performance on most object classes compared to the 
flat prior, but not all. Furthermore, the two-level prior al¬ 
ways leads to greater improvement than the one-level prior 
on object classes where a hierarchical model helps, but 
also almost always leads to a greater degradation in per¬ 
formance on object classes where the hierarchical mod¬ 
els decrease performance. Why the different performance 
characteristics for the two hierarchical models? Why do 
some categories have improved accuracy while others de¬ 
creased accuracy? In a post-hoc analysis, Salakhutdinov 
et al. (2011) note that the “objects with the largest im¬ 
provement...borrow visual appearance from other frequent 
objects” while “objects with the largest decrease [such as 
‘umbrella’ and ‘merchandise’] are abstract, and their visual 
appearance is very different from other object categories.” 

The results just described lead to numerous theoretical 
questions of practical consequence; 

Q1 Can we formalize why for some object classes there 
was a beneficial sharing of statistical strength, while 
for other classes the sharing was detrimental? 

Q2 Can we understand when a flat model should be pre¬ 
ferred to a hierarchical one to avoid unfavorable shar¬ 
ing? 

Q3 More generally, can we obtain guidance on the best 
type of prior for the problem at hand? Perhaps a dif¬ 
ferent hierarchical prior would have been better suited 
to learning the image classifiers. For example, could 
placing hyperpriors on the variance parameters lead 
to greater “robustness” for object categories such as 
‘umbrella’ and ‘merchandise,’ whose visual appear¬ 
ance differs from other object categories? 

Q4 Once the form of the prior has been chosen, how 
should hyperparameters be set to maximize learning? 
The settings of the variance hyperparameters was left 
unspecified by Salakhutdinov et al. (2011), and it is 
not clear a priori how they should be set, or how much 
effect their choice will have on learning. 

While we have primarily framed these questions in terms 
of a single model from one paper, this focus was simply for 
concreteness. Similar results leading to the same types of 
questions can be found in the numerous articles that make 



Figure 1. Right: a) Distribution of training examples per object 
class, b) Same as a), but with objects grouped by visual appear¬ 
ance. Left: Improvement in classification accuracy of hierarchi¬ 
cal models compared to flat model. Object categories are sorted 
by improvement. Reproduced and reconstructed from Salakhut¬ 
dinov et al. (2011). 

use of hierarchical Bayesian methods. For example, one 
might instead consider the hierarchical models have been 
used in political science for analyzing polling and census 
data to predict election outcomes (Ghitza & Gelman, 2013) 
and in demography for predicting population growth, life 
expectancy, and fertility rates (Raftery et al., 2013; 2012; 
Alkema et al., 201 1). 

In this paper we seek to answer the questions just posed in 
terms of two learning-theoretic quantities: regret (in online 
learning) and statistical risk (in batch learning). The online 
learning setting applies, for example, to the demography 
applications and election prediction while the batch setting 
is relevant to the image classification problem as well as 
election prediction (whether the online or batch analysis 
applies to election prediction depends on how the problem 
is formulated). 

In the online learning framework (Dawid & Vovk, 1999; 
Cesa-Bianchi & Lugosi, 2006), no assumptions are made 
about the data-generating mechanism. Inputs are presented 
to the learner one by one. After receiving each input the 
learner predicts the output, then suffers a loss after observ¬ 
ing the true output. The goal of the learner is to not do 
much worse (i.e., have large regret) compared to a fixed 
class of predictors. Online learning guarantees are attrac¬ 
tive for the analysis of hierarchical Bayesian models be¬ 
cause such models are so often used in exactly those cir¬ 
cumstances when orthodox Bayesian justifications do not 
apply: typically the modeler does not think that her model 
reflects the true data generating process, but is instead em¬ 
ploying hierarchical methods either to increase robustness 
against a poor choice of hyperparameters or to speed learn¬ 
ing by allowing for the sharing of statistical strength be¬ 
tween populations. 

Regret bounds, however, do not themselves give any gen¬ 
eralization guarantees about how the learner will perform 
on future data. Statistical risk bounds provide guarantees 
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about the learner’s expected loss on unseen examples by 
making assumptions about how the data are generated — 
for example, from an i.i.d. or strongly mixing process. Al¬ 
though there is a stochastic assumption, risk bounds also 
do not assume that the data is generated according to the 
model used. We derive a general result for transferring 
Bayesian regret bounds to risk bounds for bounded losses. 

Regret bound for a number of Bayesian models have pre¬ 
viously been developed (Vovk, 2001; Kakade & Ng, 2004; 
Kakade et ah, 2005; Banerjee, 2006; 2007; Seeger et ah, 
2008), with a particular focus on regression and simple 
priors such as independent Gaussian distributions for each 
regression coefficient. For a discussion of more general 
(but asymptotic) Bayesian regret bounds for exponential 
families and other sufficiently “regular” model classes, see 
Griinwald (2007, Chapter 8). We follow the approach orig¬ 
inally taken in Kakade & Ng (2004), and further explored 
in Kakade et al. (2005), Banerjee (2006), and Seeger et al. 
(2008), which applies to a large class of Bayesian gener¬ 
alized linear models (GLMs). We extend the technique to 
apply to certain non-GLM likelihoods as well, including to 
multi-class logistic regression. All proofs are deferred to 
the Appendices. 

We answer Questions Q1-Q4 in some important cases by 
deriving regret bounds for three types of hierarchical pri¬ 
ors. First, we consider the use of an inverse gamma hy¬ 
perprior for the Gaussian prior’s variance parameter and 
demonstrate that the hyperprior leads to greater robustness 
to data that is well-explained by setting the GLM param¬ 
eter vector to have very large norm. Next, we analyze 
hierarchical Gaussian models that allow for the sharing of 
statistical strength. Our results, which complement exist¬ 
ing work on transfer and multitask learning theory (Baxter, 
1997; Ben-David & Schuller, 2003; Pentina & Lampert, 
2014), show that when the parameters with small regret for 
a collection of related tasks are either (a) similar or (b) not 
unexpected under the prior, then the hierarchical model has 
a smaller regret bound than assuming the tasks are inde¬ 
pendent. Finally, we show that spike-and-slab priors can 
exploit sparse parameters with small regret. 

2. Bayesian Online Learning 

In online learning, the learner must predict (a distribution 
over) 2 / G 3^ C M after observing a; G A” C M". In 
this paper, we assume the prediction is made according to a 
generalized linear model (GLM) p{y \ x^6) = p{y \ 9 ■ x), 
where 0 G 0 C K" is a parameter vector to be cho¬ 
sen. GLMs provide significant modeling flexibility, and 
the class of GLM models and priors we analyze include 
a range of models used in real-world scientific applications 
(Gelman & Hill, 2006; Gelman et al., 2013). Two widely 


used GLMs are the logistic regression likelihood 

+ X) ’ (1) 

and the Gaussian linear regression likelihood p{y \6,x) = 
74 ( 2 / \ 9 ■ X, cr^), 2 / G K. Since we are taking a Bayesian 
approach, we place a prior density 2^0 (^) on 0, with cor¬ 
responding distribution Pq-* At time step t, the learner 
observes Xt, outputs a distribution over y, then observes 
Ut G y. The Bayesian (model average) learner predicts 
p{y\xt,Zt-i), where Zt = {{xi,yi),... ,{xt,yt)}, and 
then suffers the log-loss — h\p{yt \ Xt,Zt-i). Hence, the 
cumulative loss incurred is 


LsayesiZr) = I Xt,Zt-l). 


If Q is a distribution over 9, then using Q for 
prediction leads to loss on example t of £t{Q) = 

Eq[— lnp(22t I Xt, 6)] and hence cumulative loss 


Lq{Zt) = Eg 



\np[yt I xt,9) 


If Q = 5o, then we write Lg instead of Lq, so Lq{Zt) = 
¥.q\Lq{Zt)\- Our objective is to derive regret bounds of 
the form 


n{ZT, 9) 4 LsayesiZr) - Lg^Zr) < B{9) + C{T), 

( 2 ) 


where TZ{Zt,9) is the regret and B{9) + C{T) is a re¬ 
gret bound depending on the choice of prior Pq. We aim 
for C{T) = o{T), so that for a fixed 9, the average loss 
T-^LsayesiZr) is bounded by p-^LgiZr) + o(l). 

Our approach to bounding LsayesiZr) follows that of 
previous work on Bayesian GLM regret bounds with log- 
loss (Kakade & Ng, 2004; Kakade et al., 2005; Seeger et al., 
2008), relying on the following well-known result: 

Proposition 2.1 (Kakade & Ng (2004); Baneijee (2006)). 
The Bayesian cumulative loss is bounded as 

LsayesiZr) < Lq{Zt) + KL(Q||Po). (3) 


For the GLM model p{y\9 ■ x), dehne fy{z) = 
— \np{y \ 9-x = z). We make two assumptions throughout 
the remainder of the paper (they will usually not be stated 
explicitly): 

\fyiz)\<c for all 22 , 2 ; (Al) 

||a;t ||2 < 1 for alH. (A2) 

The first assumption can be understood as requiring the 
likelihood to be sufficiently smooth. The second assump¬ 
tion sets the scale of the problem, which is necessary since 

'Throughout, we use lowercase letters for densities and upper¬ 
case letters to denote the corresponding measures. 
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scaling Xt up requires scaling 9 down, and vice-versa; 
p{y I C~^6 ■ Cx) = p{y \ 6 ■ x) for any C 7 ^ 0. Note 
that for the Gaussian linear regression model with vari¬ 
ance (T^ and the logistic regression model, (Al) holds with 
c = 1/cr^ and c = 1/2, respectively. Proposition 2.1 leads 
to the following theorem for obtaining regret bounds for the 
Bayesian model average learner. 

Theorem 2.2 (Bayesian regret meta-theorem). Let Qe-,4, 
be a distribution with parameter (written (j) if 

d = 1) and mean 6*. If (Al) and (A2) hold, then for all (p, 

Tc 

n(Z,e*) < —II VarQ,._J0]|| +KL(Qe*>||Po), 

where || • || is the spectral norm. In particular, if the com¬ 
ponents of 6* are uncorrelated, then || Varg^, ^[0]|| = 
suPiVarg,. 


Theorem 2.2 is our first main result and will be repeatedly 
applied in Section 4 by choosing an appropriate Qg» and 
then optimizing cp. Although the bound appears to be linear 
in T, typically <p can be chosen such that || Varg^. ^ [0] || = 
0(T“^) and KL((5e. <^||Po) = 0(lnr), leading to a log¬ 
arithmic regret bound. Theorem 2.2 provides an attractive 
approach to deriving regret bounds because there is no need 
to work directly with the posterior, which is often analyti¬ 
cally intractable. For example, there is no closed-form ex¬ 
pression for the posterior of the Bayesian logistic regres¬ 
sion model. The theorem generalizes the approach origi¬ 
nally taken in Kakade & Ng (2004), in which a Gaussian 
prior for 9 was considered; 

Theorem 2.3 (Gaussian regret (Kakade & Ng, 2004)). If 
9 ^ N(0, cr^/), then TZ{Z, 9*) is bounded by 

i 1" (1 + ■ 


2.1. Beyond GLMs 

Theorem 2.2 follows from a more general result. Theo¬ 
rem 2.4, which allows for non-GLM likelihoods. Specif¬ 
ically, instead of the likelihood being a GLM, we assume 
the likelihood can be written in the form p{y \ x, %p) = 
p{y I ^x, tp), where ^ G R" is a matrix and ip G 
R"”. The full parameter vector is 0 = (^,ip) G R^, 
N = nn' -f n” (implicitly flattening the matrix £). Let 
fy { z ) = — lnp(j/ 1 (^a;, ip) = z ). We require the follow¬ 
ing assumption in place of (Al); 

\\ fyiz)\\<c for ally, 2 :, (AL) 

where fy{z) denotes the matrix of second partial deriva¬ 
tives (Hessian). 

Theorem 2.4 (Generalized Bayesian regret meta-theorem). 
Let Qe* . 4 , be a distribution with parameter (p G ^ C R'^ 


and mean 9*. If (Al’) and (A2) hold, then for all (p, 
n{Z,9) < varg,._^[0]|| +KL(Qe^<^||Po), 


Of particular interest is that Theorem 2.4 can handle multi¬ 
class logistic regression (MLR). In multi-class regression, 
each example Xt has one of K labels yt G {1,..., K}, 
indicating which class the example belongs to. For MLR, 
each class k has an associated parameter 9^^\ The param¬ 
eters are combined into a single likelihood; 


p{yt I o,xt) 


exp(0*-^*^ • x) 
J2k=i exp(6/('=) • Xt) 


(4) 


Theorem 2.5 (MLR Gaussian regret). If 9^^^ ^ 

N(0, (T^I), k = 1,... ,K, then using the MLR likelihood 
guarantees that TZ{Z, 9*) is bounded by 


TD‘ 

K 


mlr—G 

Bayes 




3. Risk Bounds 

While online regret bounds are attractive because they 
make no assumptions about the data-generating process, it 
is also desirable to have risk bounds in the batch setting 
since risk bounds provide generalization guarantees for un¬ 
seen data. We now develop a connection between regret 
and risk bounds via a PAG-Bayesian analysis (McAllester, 
2003; Audibert & Bousquet, 2007; Catoni, 2007). Such 
bounds also have the benefit of applying to any bounded 
loss (e.g., the 0-1 loss for binary classification), which may 
be more task-relevant than the log-loss. In the batch set¬ 
ting, the data Zt are received all at once by the learner and 
are assumed to be distributed i.i.d. according to some dis¬ 
tribution V over X X y-. {xt,yt) 27, f = 1,..., T. Let 
f be a bounded loss function taking a probability distribu¬ 
tion over y and an element of y as arguments. Without 
loss of generality assume ^ G [0,1]. Writing 2/) = 
I{p {-1 X, 9), y), for any distribution Q over 0, let 

£(Q) = 

£(Q, Zt) ^ T-i ELi Ee..g[4(a;t, yt)] 


be, respectively, the expected and empirical losses under 
Q. PAC-Bayesian analyses consider the risk of the Gibbs 
predictor for the distribution Q (i.e., sample 9 ^ Q, pre¬ 
dict with p{- I X, 9)), not the model average over Q (i.e., 
predict with f p (-1 x, 9)Q(d9)). A typical bound (spe¬ 
cialized to the Bayesian setting) is the following (here 
Pt(9)^p(9]Zt)): 

Theorem 3.1 (Audibert & Bousquet (2007)). Fix At > 1/2 
and write k' = (2k — 1). For any distribution T>, with 
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probability at least 1 — (5 over samples {xt, yt) 2?, 
\C{PT)-t{PT,ZT)\ 

< T-^/^s/^y/KL{PT\\Po) + \nn'/5\ 

Combining the PAC-Bayesian risk bound with Bayesian re¬ 
gret bounds leads to our second main result; 

Theorem 3.2. Assume that (2) holds and fix k > 1 / 2 . For 
any distribution P, with probability at least 1 — 5 over sam¬ 
ples (xt,yt) P, 

\C{Pt)-C{Pt,Zt)\ 

, r-^ -' (6) 

< B{6) + C{T) + Inn'/5 , 
where 6 = argmiug Lg{Zx)- 

An attractive feature of Theorem 3.2 is that the bound 
does not rely on understanding the posterior Px, as is re¬ 
quired by a direct application of a PAC-Bayesian bound 
such as that given in Theorem 3.1, which requires calcu¬ 
lating KL(PTll^b)- Yet the PAC-Bayesian regret bound 
remains data-dependent due to its dependence on the em¬ 
pirical risk minimizer (ERM) p{y \ x, 9). 

Examining the proof of Theorem 3.2, it is easily seen that 
in fact any 9 such that Lg{Zx) < Lp^ {Zx) can be chosen 
in place of 9. Such alternative choices may lead to sig¬ 
nificantly tighter bounds and are particularly important, for 
example, in the application of the theorem to the spike-and- 
slab prior (cf Section 4.3), as the ERM parameter will in 
almost all circumstances satisfy ||6|lo = n, which would 
lead to a poor generalization bound when n is large. 

In words. Theorem 3.2 can be understood as stating that 
if the Bayesian (model average) learner has small log-loss 
regret compared to the ERM, then with high probability 
the Bayesian Gibbs predictor will generalize well if the 
loss function is bounded. Or, as a slogan, the theorem 
shows that “small regret in the online learning setting im¬ 
plies good generalization bounds in the batch setting.” The 
theorem thus connects PAC-Bayesian bounds, Bayesian re¬ 
gret bounds, and empirical risk minimization. 

4. Applications 

We now use Theorem 2.2 to analyze hierarchical priors for 
robustness, sharing of statistical strength, and feature selec¬ 
tion. 

4.1. Hierarchical Priors for Robustness 

In this section we answer questions Q2-Q4 as they relate 
to hierarchical priors for robust inference, demonstrating 
how, with a proper choice of hyperparameters, a hierarchi¬ 
cal prior can lead to increased robustness compared to a flat 


prior. Specifically, we analyze a canonical use of a hierar¬ 
chical prior — to capture greater uncertainty in the value of 
a parameter by placing a hyperprior on the variance of the 
Gaussian prior on that parameter (Berger, 1985; Bishop, 
2006; Gelman et al., 2013): 

(To and | po, cto ~ (Tq), 

where r“^(a,/3) is the inverse gamma distribution with 
shape a and scale /3. Let v = 2a and cr^ = /3/a. Then 
the marginal distribution of 9 follows the multivariate t- 
distribution with location /igl, scale matrix a^I, and v de¬ 
grees of freedom: 

9 I pq, ct^, V ^ 


where 1 is the all-ones vector. The multivariate t- 
distribution density is 


Px{9\fi,Z,,i^) 

r(^) (1 + - pV^-H 9 

r(|)7r"/2W2|S|l/2 

When V is finite, the multivariate f-distribution is heavy¬ 
tailed; the probability of 11011 decreases at a polynomial rate 
as ||0|| —> 00 , compared to the exponential rate for a mul¬ 
tivariate Gaussian. Eor example v = = 1 and S = cr^ 

gives the multivariate Cauchy distribution. A multivariate 
Gaussian with covariance matrix E is recovered by taking 
V ^ 00 . Placing a multivariate f-distribution prior on 9 
yields the following regret bound; 

Theorem 4.1 (Multivariate f-distribution regret). If 9 
Ty (0, a^I), then TZ{Z, 9*) is bounded by 


T}mvt ( ry f\^\ ^ ^ T rt 
^BayesV^i^ ) “ ^ 

+ n i„ f + 

2 V 


1*112 


In 1 + 


Tc{v 


(7) 


Theorem 2.3 can be obtained as a special case of Theo¬ 
rem 4.1 by taking v ^ 00 . 


Assume 12 > 1. If 


!10*ll 


7^(l|0||) = ^ln(l + 


is small, then 


n-^v \\9*P 


9 ’ 

rZ 


so for “small” values of ||0*|P (relative to i^a^) the regret 
bound behaves similarly to having a Gaussian prior on 9. 

11^ 

However, if ^—j- 1, then the regret bound grows only 
logarithmically with ||0|j, as we would expect given that the 
multivariate <-distribution has heavy tails. Roughly speak¬ 
ing, F{x) can be thought of as switching from quadratic 
to logarithmic behavior when since this is the 

value at which F switches from being convex to concave. 
















Risk and Regret of Hierarchical Bayesian Learners 


In general, the regret bound is large when the choice of 6 * 
with small loss has large magnitude. If a Gaussian prior 
is used, the possibility of ||0*|| being large can be ame¬ 
liorated by choosing large, since there is only a log¬ 
arithmic regret penalty in a. However, without a priori 
knowledge of how large the optimal 0 * might be, choos¬ 
ing a multivariate f-distribution prior with a small value for 
V and a moderate value for cr^ allows for guaranteed loga¬ 
rithmic regret in the magnitude of 9* no matter how large 
II0*11 is. Hence, the use of the hierarchical (multivariate t- 
distribution) prior does in fact yield greater robustness than 
the non-hierarchical (Gaussian) prior.^ 


We can, in fact, develop more specific guidance on the 
choice of hyperparameters for the f-distribution. Our goal 
is to choose v such that we obtain a f-distribution regret 
bound that is essentially as good as the Gaussian prior re¬ 
gret bound R Bayes = -f f In ^1 -f for small 

II 0*11 and better when ||0*|| is large. If we choose ly equal 
to a constant, then for n much larger than ly, we have 


TDmvt 

^Bayes 


In (1-1- 


In 




Tca^ 


In the 


case of ||0 II small, we therefore have that the first term of 


R^ayes approximately g , and thus larger than the 
first term of R%ayes ^ factor of niv. Furthermore, for 
small T and any 0*, the second term of RBayes 

is approximately ^\n.{n/v) whereas the second term of 
RBayes approximately Tccr^ ^ n. Thus, R^Tyes With 
constant v is not competitive with RBayes ttt the large n 
and small T regimes. Instead consider the choice v = Cn 
for constant C > 0, so 


R 


mvt 

Bayes 


iC + l)n. 


In 1-1- 


10^ 


*l|2 


Cna"^ I 




In 


C 


C 


1 Tca^ 
- + - 


< 


C+l ||0*||2 

C 2ct2 


n , 

+ 2^" 


C + 1 Tca^ 


C 


In this case, by choosing a moderate value of C, we see 
that a multivariate f-distribution prior with i/ = Cn has 
a competitive regret bound with a Gaussian prior in the 
small 110*11 regime, and exponentially smaller regret bound 
as II0*11 becomes large. Furthermore, the t-distribution 
prior remains competitive with the Gaussian prior when T 
is small. 

more rigorous version of this statement can be obtained 
for the Gaussian regression likelihood by using the fact that 
there is a matching lower bound on the regret for the Gaussian 
prior/Gaussian regression model (Kakade & Ng, 2004). 


4.2. Hierarchical Priors for Sharing Statistical 
Strength 

4.2.1. Background 

We next consider hierarchical priors that allow for the shar¬ 
ing of statistical strength, providing answers to Q1 and Q2: 
we specify some conditions under which sharing of statis¬ 
tical strength can be achieved and others in which a non- 
hierarchical prior is preferable.^ In the machine learning 
literature, the goal of “sharing statistical strength” has been 
formalized via multitask learning (MTL) and “learning-to- 
learn” (LTL) frameworks. A number of theoretical investi¬ 
gations of MTL and LTL haven been carried out, beginning 
with a series of papers by Baxter (cf. Baxter, 1997; 2000). 
Generically, such MTL and LTL frameworks involve two 
or more learning problems that are related to each other in 
some manner. The learning properties are investigated as 
the number of tasks and/or the number of examples from 
each task is increased. Baxter (2000) and Ben-David & 
Schuller (2003) give sample complexity bounds based on 
classical ideas from statistical and PAG learning theory. 
Baxter (1997) examines the asymptotic learning proper¬ 
ties of hierarchical Bayesian models. Pentina & Lampert 
(2014) take a PAG-Bayesian approach while Hassan Mah¬ 
mud & Ray (2007); Hassan Mahmud (2009), and Juba 
(2006) develop notions of task-relatedness from an (algo¬ 
rithmic) information-theoretic perspective. 

Typically, tasks are equated with probability distributions 
over examples (e.g., {x,y) pairs). It is assumed that the 
tasks are drawn i.i.d. from an unknown task distribution. 
The goal is to learn the individual tasks and learn about 
the task distribution. Alternatively, a notion of similarity 
can be used to relate the tasks; the more similar the tasks, 
the greater the advantage of learning them using multitask 
algorithms. In the online learning framework no assump¬ 
tions are made about the distribution of examples, so we 
consider two MTL scenarios in line with the latter setting. 
In the first scenario, one example from one task is received 
at each time step. In the second, which is described in the 
Appendix E. 1, at each time step an example for each task 
is received simultaneously. 

4.2.2. Sequential Observations from Multiple 
Sources 

The sequential observation setting is relevant to the im¬ 
age classification example given in the introduction, in 
which there are many observations from some data sources 
and only a small number of observations from numerous 
other data sources. To model this situation, at time step 
t, an input Xt from source zt is observed, where zt S 

^For simplicity results are for Gaussian priors, though the ex¬ 
tension to multivariate f-distribution priors is straightforward. 
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{1,..., K}. The learner predicts yt according to the poste¬ 
rior of 0^^*^ given Zt_i. An equivalent formulation is that 
the Bayesian learner observes Xt = (0,..., x^^\ 0,...) 
(if Zt = k) at each time, then receives yt- Instead 
of using independent Gaussian priors on 0'^^\ ..., 
place a prior over the means of the K priors. For each 
dimension j = let yj\<jQ ~ 3^(0, erg) 

^ k = and write 

q(^-K) a ..., Integrating out y.j yields 

( 8 ) 

where, with lx denoting the K x K all-ones matrix, 

+ s'(1-pK (9) 

2 A 2 I 2 A 2// 2 I 2^ /iA\ 

S— (Tg+CT, p— tTg/ ((Tg + CT ), (10) 

This prior corresponds to the one-level prior in Salakhutdi- 
nov et al. (2011). Similar results, which will be discussed 
qualitatively below, can be obtained for the two-level prior 
at the cost of a significantly more complicated bound. De¬ 
fine = {{x,y) e Z\zt = k}, 

and 7 ^ = Ka^ + 


/ I 7 - 1 ( 1 ) \ 

In ( n+T(i)c 5 ^ ) ~ ^ becomes (approximately) 

< -p|j6>*(2)||2 AO.SeSs^n. 

But even for n = 1, 31n(|^J±|^^ < 0.863, so the 
hierarchical model has smaller regret bound as long as 

4||g,*(l)_^*(2)||2 < + 

constant 0 < C < 0.863. 

Hierarchical models for one-shot learning are designed 
with the goal of providing good predictive power on the 
new problem (the second data source) even with a small 
number of examples from that problem. To see if this is in 
fact the case for the hierarchical prior considered here, we 
can investigate how much greater the regret bound is for 
T(2) > 0 than for = 0 with ^ fixed. With 

CTg = cr2, (11) is greater in the former (K = 2) than the 
latter (K = 1) scenario by at most 

||0*(1)||2 ||6/*(2)||2 2 \\e*^^'> 

6s2 3s2 3g2 g 


Theorem 4.2 (Hierarchical Gaussian regret, sequential ob¬ 
servations). If ~ 3\r(0,E), j = then 

TZ{Z, 6*) is bounded by 


R 


HG—seq 

Bayes 







It is instructive to compare the upper bound given in (11) 
to Efe RBayesiZ{k),0*^^^) with prior Variance = a'^ + 
cr2. Setting cg = cr yields a condition for the hierar¬ 
chical model to have smaller regret bound than the non- 
hierarchical model; 


4||r<« - o-raii" + 3 .^»eLi1>i (feSSS) 

< -f ||r(2)f + 0.8635^71. 


( 12 ) 


Of particular interest is the “one-shot learning” scenario, in 
which only one observation (or a small number of obser¬ 
vations) from a source are made while many observations 
are made from some other sources. This setting is exactly 
that of the image classification problem of Salakhutdinov 
et al. (201 1). For concreteness, consider a “large data” task 
with ^ and a “small data” task = 2 , so that 


So if is small and 0 *^^^ and 0*^^^ are close in 

distance, the regret bound for the second source is small. 

The regret bound for the two-level prior in Salakhutdinov 
et al. (2011) is quite similar to that for the one-level prior. 
Let Sk S {1,..., S'} denote the superclass of class k. In the 
case of image classification, object classes that have similar 
visual appearance would have a common superclass. The 
two-level prior consistes of an overall parameter prior j3 ^ 
3^(0, CTg/), superclass parameter priors ^ 3\f(/3, afl), 
and class parameter priors 0^^^ ~ 3Nf(/i.(®''\ (t|J). The re¬ 
gret bound for the two-level prior is 

+ ”EtiO(ln(ci+C2rW)) + 0(l), 

where cg,ci,C 2 > 0 are constants, Cke = Csk if Sfc = 
and Cke = Cs^.8i if Sk ^ si. Furthermore, > Cs's" for 
all s, s', s" S {1,..., S'}. Hence, the regret bound’s being 
small depends more on the parameter vectors in the same 
superclass being close to each other than on parameter vec¬ 
tors from different superclasses being close to each other. 
See Appendix E.2 for details. The two-level regret bound 
well-explains the results of Salakhutdinov et al. (2011). 
The poor performance on image classes with very differ¬ 
ent visual appearance from the other classes is unsurpris¬ 
ing since the parameter vectors that predict these classes 
well are going to have large distance from the parameter 
vectors of other object classes. 
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4.3. Hierarchical Priors for Feature Selection 

A shortcoming of the priors investigated so far is the poor 
dependence on the feature space dimension n. For exam¬ 
ple, the Gaussian prior regret bound is (approximately) 

LsayesiZ) < inf Lg* {Z) + IP H 2 > 

when Tca^ ^ n, so the regret may grow linearly in this 
regime."^ In the infinite-dimensional case, Gaussian pro¬ 
cesses can be used while still obtaining meaningful regret 
bounds (Kakade et al., 2005; Seeger et ah, 2008). However, 
methods that are applicable to high-dimensional problems 
for which n ^ T but n is still finite are of great general 
interest. For example, in the image classification example 
from the introduction, the feature vector has n « 5000, 
whereas most object classes have fewer than 200 training 
examples. In high-dimensional problems it is desirable to 
use feature selection or sparse methods to reduce the ef¬ 
fective dimension of the problem, with the aim of achiev¬ 
ing better generalization performance and increasing in- 
terpretability of the model. A popular non-Bayesian ap¬ 
proach for inducing sparsity is ii regularization, such as the 
lasso for linear regression (Tibshirani, 1996). A Bayesian 
approach is the Bayesian lasso: the regularizer of the 
lasso is converted into a prior, which amounts to placing a 
Laplace prior on 6 (Park & Casella, 2008). However, the 
Bayesian lasso still seems to lead to a linear dependence 
on the dimension because the model puts zero prior mass 
on a component being exactly zero. A regret bound for 
the Bayesian lasso can be found in the Appendix F.l (we 
suspect that our bound is essentially tight, though we have 
been unable to obtain a matching lower bound). 

Another common Bayesian approach to inducing spar¬ 
sity is to use a hierarchical “spike and slab” prior, which 
places positive probability on a component being exactly 
zero (Ishwaran & Rao, 2005; Narisetty & He, 2014). One 
version of the spike and slab prior is 

2 i|p~Bern(p) and 9^\ Zi ^ + {1 - 

So with probability p component i is zero and with prob¬ 
ability 1 — p it is Gaussian-distributed. Integrating out Zi 

yields prior density po(fi'i) = + I 

Let II v|jo denote the Iq norm of the vector v. 

Theorem 4.3. For the spike-and-slabprior, ifm = ||0*||o. 
then TZ{Z, 9*) is bounded by 

p 2 \ m J 

''Dimension-independent regret bounds for the priors already 
considered can be obtained, but at the price of a constant greater 
than one front of the Lb* (Z) term. See, e.g., Banerjee (2007). 


In particular, if p = for some constant 0 < g < 1, 
then R %1 {Z, 9*) is at most 


2a2 


Tca^ 


, 'n- 1 lea \ 

-I-min--h In --f — In ( 1 H- . (15) 

1 — g g 2 V m / 


The theorem shows the importance of properly scaling p 
with the dimension of the problem. If p is kept fixed, then 
the regret has linear dependence on n. However, by scaling 
p to be we increase the probability of a component 
being zero as the dimension increases and thus are able to 
ensure that the regret is only logarithmic in n while simulta¬ 
neously maintaining the appropriate linear dependence on 
TO. The constant g turns out to be the prior probability 
that all of the components are set to zero. More gener¬ 
ally, limiting 

prior probability of choosing exactly k components to be 
non-zero. Hence, as n —oo, the prior over the number of 
non-zero components converges to a Poisson distribution 
with rate parameter In g”'^. So when n is large the expected 
number of non-zero components is « In g”'^. The choice of 
p close to 1 for large n is in notable contrast to the common 
practice of setting p = 1/2 or some other constant indepen¬ 
dent of n (Schneider & Corcoran, 2004; Ishwaran & Rao, 
2005). Our results strongly recommend against this prac¬ 
tice. See Narisetty & He (2014) for a discussion of purely 
statistical reasons to scale p with the dimension. 


5. Conclusion 

In this paper we set out to understand and quantify the 
learning-theoretic benefits of Bayesian hierarchical mod¬ 
eling. In Section 4, we used first our main result. Theo¬ 
rem 2.2, to analyze three specific hierarchical priors that, 
particularly when combined with a logistic or Gaussian re¬ 
gression likelihood, are widely used in practice. Indeed, 
these prior-likelihood combinations have often been used 
with substantial success even in situations when they are 
known to be rather poor models for the data generating 
mechanism. Our analysis offers an explanation for this suc¬ 
cess. The priors we analyzed are representative of the va¬ 
riety of ways in which hierarchical models are employed: 
representing uncertainty in hyperparameters, tying together 
related groups of observations, and creating more compli¬ 
cated distributions from simpler ones. Thus, our results 
answer Questions Q1-Q4 in some important cases and ex¬ 
emplify a learning-theoretic analysis technique that can be 
applied to other hierarchical models. In addition, using our 
second main result. Theorem 3.2, all of the insights gained 
in Section 4 for the log-loss regret setting apply equally 
well to the batch setting of statistical risk with bounded 
loss, further extending the applicability of our conclusions. 
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A. Regret Bounds for Non-GLM Likelihoods 

Recall Proposition 2.1, restated here for convenience: 

Proposition. The Bayesian cumulative loss is bounded as 

LBayesiZr) < Lq{Zt) + KL(Q||Po). (A.l) 


Proof of Theorem 2.4. Fix a choice of Q* and 0 and write Q = Take a second-order Taylor expansion of fy about 

z*, yielding 

+ fyiz*V{z - z*) + fy{C{z)){z - z*), 

for some function (^. Let z = {^x, if) with 6 ^ Q and let z* = £[ 2 :] = ^p*). Hence, 

E.[/,(z)] = fy{z*) + fy{z*yO + [(2 - z*Y f^{az)){z - z*)] 

</,(2*) + |e. [(2-2*)^(2-2*)]. 

Defining 



71 ' times n" times 


we next observe that 


(2 - 2*)^(2 - z*) = u}^{e- e*){e - e*yu}. 


(A.2) 


Letting S = Var[6], we thus have 

E;, [(2 - 2*)^(2 - 2 *)] = a;^Ee[(0 - e*){e - r)^]a; 

<||a;||2||E,[(0-r)(0-rmi 

= (n'||ir||^ + n")||E|| 

< (n' + n")||S|| 


since it is assumed that ||a :||2 < 1. Noting that Lq(Zt) = J2t^Q[fytH^t,ip)] Le^^ZT) = fytiCxu-ip*), we 
have 


Lq{Zt) < Le*{ZT) + 


Tc{n' + n”)\\T.\\ 
2 


Combining (A.l) and (A.3) yields the theorem. 


Proof of Theorem 2.2. Follows as a special case of Theorem 2.4 by choosing n' = 1 and n" = 0. 


(A.3) 

□ 

□ 


A.I. Application to Multi-class Logistic Regression 

For multi-class logistic regression (MLR) y S {1,..., K} is one of K classes, the parameters are 0 = and the 

likelihood is 


p{y\e,x) 


exp(0^^^ • a:) 
Ef=iexp(0W.ir)' 


(A.4) 


In order to apply Theorem 2.4, we require the following result: 

Proposition A.l. Assumption (AF) holds for the MLR likelihood with c = 1 / 2 . 
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Proof. First note that 


fy{z) = -Zy+ lnX;f=ie 


where Zi = 0*^^^ • x, and hence the Hessian of fy{z) is independent of y: 








—e 


Z 1 +Z 2 


/:(^) = 


dZ2+Z1 






# 2 ' 


0^2+21 


—e 


ZI+Zk'' 

0 Z 2 +ZK 


Applying Gershgorin’s circle theorem, we find that 


\\fyiz)\\ < . 


E-ie^- 


(EEie-)^^ 


where with loss of generality we have applied the theorem to the first row of the Hessian. Defining a 
b = E.5^ie"^ > 0, we have ||/y ( 2 :)|| < (a+b)'^ ■ Maximization over the positive orthant occurs at 

ii/;'(^)ii < V2. 


Reasoning similarly to Theorem E. 1, one can easily prove: 

Theorem A.2 (Hierarchical Gaussian regret, multi-class regression). If ~ ^(0, S), j = 1,... ,n 

MLR likelihood guarantees that TZ{Z, 6*) is bounded by 


R 


>mlr—HG 

Bayes 




=k(fe)||2 , 


(j2j2 2-^ 


” 1 
2^" 


1 + :^ 


nK 


k<t 


— In ( 1 - ^ + 


Q*(k) _ Q*(i) |j2 
-2 


2 n 


where 7 ^ = ATcrg -f 

Theorem 2.5 follows as a special case of Theorem A.2 by taking Ug = 0. 


B. Proof of Theorem 3.2 

SincepHe) = ^(W]f;(^) , 


KL(Pt||Po) =Ep, 

= Epy 


In 

In 


Prid) ' 

PoW. 

Piy\x,oy 

p{Y\X) _ 


= LBayesiZr) — Lp.j.{Zt). 


Combining (2) and (B.l) with Theorem 3.1 implies that with probability 1 — S, for all 9, 

iLeiZr) - Lp^{Zt) + B{e) + C{T) + \uk'/5 ' 


|£(Pt) — £{Pt, Zt)\ < \/k 


T 


(A.5) 


(A. 6 ) 


(A.7) 


4 > 0 and 

f, = 6 > 0 , so 

□ 


, then using the 


(A. 8 ) 


(B.l) 


Observing that Pe* (Zp) < Lp.^ (Zp), so Pg* (Zp) — Lp.^ (Zp) < 0 , completes the proof. 
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C. KL Divergence Derivations 

C.l. Multivariate Gaussians 

Let Di = Si), i = 1 , 2 , where dim(/Xi) = n. Then 


KL(7^i||7^2) = 2 ED 1 


“ ( 2 ; “ - Ml) +ix- ^J,2v^2 - M 2 ) 


= ^ |ln + Edi [- Tr(Ei \x - Aii)^(a; - Mi)) + Tr(E2 - ^J,2V{x - 

^ - Tr(Sr^Ei) + Ed, [Tr(S^i(x^x - 2x^+ mJ 

i| 

= ^ -u + Ed, [Tr(E^\x^x - 2 x^/X2 + 


= o -n + Tr(E2 (Ei + - 2 ^) ^2 + M2 M2)) 


= - <j In ^ - n + Tr(E2-'Si) + ' E2'(^i - M2) • 

Li 


C.2. Gaussian and t-Distribution 

Let Di = El) and D2 = 7u{p^2, E2), where dim(^i) = k. Then 


KL(i:ii||i:i2) =ln 


j + 2 + 2 - 2 ■ 2 

u + k 


E 


Di 


In ( 1 + -(x - /i2)^E2 ^(x - H2) 

' V 


= In 


, i,je2| k. 

V{^) 


H— In - - -In 2 e 

2 |Ei| 2 


ly + k 


E 


D, 


In ( 1 + -(x — fj.2)^E2 ^(x — IJ.2) 


For the first term, if k is even, then 


r (|) i /'=/2 ^ j ^ fc /2 

r(i^) “ 


where y— = y{y — 1 )... (y — n + 1 ) is the descending factorial. Now assume k is odd. By Gautschi’s inequality, 
r(riv2) ^ Choosing a = vl2 yields 


r(^) ~ ~ (|)i/2(£^) (fc-i)/2 ' 


Now, bounding the expectation gives 
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/ (mi - - ^ 2 ) 

^ +^Tr(S2-iSi), 

where the second inequality follows from the fact that ln(a + 6) < ln(a) + b/a. Combining everything yields 
KL(Di||D 2 ) < In + 1 in ^ ^ ln2e + Tr(E^iSi) 

+ In ^1 + ^(/ii - - ^ 2 )^ , 

where 


< In ( 1 H—(/ii — /i 2 )^S 2 ^(/ii — /J. 2 ) 

' V 


< In ( 1 + -(^1 - fJi 2 y ^2 ^(a^i - M 2 ) 


— 


^ u-\-k '^k/2 

(|)l/2(jiM) (fc-l)/2 


if k is even 
if k is odd. 


C.3. Gaussian and Laplace 

Let Di = Af(/i, tr^) and D 2 = Lap(/3). Then 


KL{Di\\D 2 ) = ln(2/3) + iEDj|a;|] - ^ ln(27recr^) 


= ^ 


/rErf ( 


M 


V V^o 


2 V^c 


■ exp 


-^1 

2ct2 / 


— - ln(27recr^) 


<iln^ 


1 


lMK/l-exp<i-^ 


2v^o 


exp 


1 

2a2 J 


- 2 


D. Proof of Theorem 4.1 


Choose = Af(0*, (/>^/). With Pq = Tj^(0, a"^!), we have (Appendix C.2) 


n , a" n , ^ n{y -\- n) (jr y n 


KL((5e*.<^||A)) < lnAj,,„ +-In—--ln2e+ 


ln( 1 + —1|6» 
y<7^ 


*l|2 


where 


A 


L'.n — 




if n is even 
if n is odd. 


Note that if n is even then < 1 and if n is odd then Since Varg^, ^ [9i\ = cj)^, we have 


TccjP n v + 1 n n n{v + n) v + '~ 


LSaves (^) ^ inf Lq* (^) -f--f — In--t- — In —^ — — -t- 

SayesK J _ ti \ J ^ ^ ^ ^ u 2 (j)^ 2 2v 


In 1 + 


Choosing (fp' = 


T CIV (7 2 + ( 1 /+ n) n 


yields the theorem. 


E. More on Hierarchical Priors for Sharing Statistical Strength 

E.l. Multiple Simultaneous Observations 

The Bayesian learner receives K input-output pairs time step. Each output is predicted using 

a separate weight vector so the fc-th likelihood is p[y \ 0*-^^ • x), k = 1,... ,K. Write 
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Instead of using independent Gaussian priors on ..., place a prior over the means of the K priors. For each 
dimension j = 1 ,..., n, let 


I CTo ^ ^(0, CTo) 


(E.l) 


and 




(E.2) 


and write = {0j^\ ..., Integrating out fj,j yields 


(K). 


where, with Ik denoting the K x K all-ones matrix, 


(E.3) 


J:^s^p1k + s\1-p)I 


2 A 2 I 2 

s = an + a 


A 

P= ^2 


ai +a^ 


(E.4) 


( 1 ) 


The Bayesian learner uses this hierarchical prior to simultaneously predict yl 
must replace (A2) with an appropriately modified assumption for the simultaneous prediction task; 

II 2 ^ 1 for all t, k. 


y[^\ Eor the following theorem, we 


(A2’) 


Theorem E.l (Hierarchical Gaussian regret, simultaneous observations). If ~ 2^(0, S), j = 1,..., n, and (A2’) 

holds in lieu of (A2), then TZ{Z, 6*) is bounded by 


R 


<HG—sim ( 


Bayes (Z, 0*) ^ ^ ^ f 


A"ln,l 


Ka^o 


nK 


lnll-4 + — 

7 ^ n 


(E.5) 


where 7 ^ = Ka^ + cr^. 

It is instructive to compare the upper bound given in (E.5) to P^BayesiZ(k): with prior variance = 

To do so, we find A{9*) 4 - R^g,,{Z, 9*): 


nK 


In 


n^{l - 7) + Ecs 


n + Tcs'^ 


^ 1 

--i„ 






a2K 


For example, setting ao = a, so the correlation p is 1 / 2 , and X = 2, we find that if 

4|j6/Ai) _ 61*(2)||2 ss^nln ( 3» + ^7 ^ < ||0*(i)|j2 ||6/*(2)||2 Q.SfiSs^n, 

\ n + Tcs‘‘ J 

then the hierarchical model has a smaller regret bound than the non-hierarchical model.^ As long as Tcs^ > 2n, the 
condition becomes |p + Cs^n for some 0 <C < 0.863. In this case there are two 

important observations about the benefits of the hierarchical model. First, noting that the expected magnitude of |p 
and 110**-^^IP is cr^n, as long as and are only a constant fraction C/A of their expected magnitudes, 

the hierarchical model will always have smaller regret bound. Second, even if the previous condition does not hold, the 
difference in must be significantly larger than the expected magnitudes of || 0 **'^pp and || 0 **'^pp for the 

hierarchical model to have a larger regret bound than the non-hierarchical model. Thus, the use of the hierarchical model 
has potentially significantly reduced regret compared to the non-hierarchical model. 

^ For clarity, we have replaced 3 ln(4/3) with the bound 0.863. 
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E.2. Two-level Prior 

In this section we derive bounds for the two-level prior in the case of sequential observations. Recall that the prior is 

Integrating out f3, we immediately obtain: 


s = 


k = l,...,K. 






(E. 6 ) 

(E.7) 

(E. 8 ) 

(E.9) 


where Ep, = agls + o-fl. Writing = fii ' and Oi = ' , we have 




E = 


E/i E^e 

E« 


E^ 


fi 6 


(E.IO) 


(E.ll) 


Hence, 

I /Xj ~ 3sr(E^gE^ Eg — E^gE^ ^E^e). 

Define the matrix P such that Pks = l{s = s^}. We therefore have = Pfi^ and hence E^g = PT,f^, and 

furthermore Eg — E^gE“^E^g = cr|/ and hence Eg = cr|/ + PE^P^. 

Hence, the prior on 6 i is Pq = 3^1(0, Eg). Choose Qe*, 4 , = diag cf)), yielding 


KL(Qg*,<^||Po) = i jin - k - Tr(Eg-i) ^ 4>l + . 


Straightforward calculations show that the regret is bounded by 

n K 


J 2 i 0 :V^ 9 ^ 0 : + E 9 2Tr(Eg-i) 


k=l 


cT^k) 


+ ^ln|Eg|. 


(E.12) 


(E.13) 


E.3. Proof of Theorem E.l 

Eirst take n = 1, which will later generalize to arbitrary n. Choose Qg-a.K) ^ and note that 


m=a^^-^{Kal + a^)=a^^-W and E'^ = 

(7^7^ (7^ 


Thus (Appendix C.l) 


KL(Qg.(.^)_^||Po) = i |ln i - X + 02 Tr(s-i) + | 


2 ^ 2 2(7272 ' 


K 




k=l 






k<l 


Moving to the case of general n, since Varg^, 6*7] = K(j}^ for all j = 1,..., n, applying Theorem 2.2 gives 


K 


LB..UZ) < E (z“') + y + =0 In 


/c=l 


2 (/)2a2/^ 


nK{-f'^ - a'^) 2 , 1 




2(7272 


vEiio- 

' fc=i 


(/c)||2 ^0 


0-2.-y2 


^ ||^*(fe) _ Q. 


W \\2 


k<e 


Choosing 0 ^ = 


yields the theorem. 


n{P-a^)+Tct7'^P 
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E.4. Proof of Theorem 4.2 

The proof is similar to that for Theorem E. 1. However, use separate variances for each source: 


'Y'i, fc) 

The error term from the Taylor expansion used in Theorem 2.2 is — 2 ^’ 


K 






2 

K 


2 am 


k 'l^k 


nK 

~Y 


2cr2^2 ’ 27221^"" " ' a 272 

' k ' k=l ' k<e 




Choosing = 


k 72(72 —0-2) _|_j'(fc)ccr2'72 


yields the theorem. 


E More on Feature Selection 

El. The Bayesian Lasso 

For Bayesian model average learner we have: 

Theorem F.l (GLM Bayesian lasso regret). If6i ~ Lap(0i, /3), i = 1,..., n, then 




2^" 


2T2c2/34 


(^\/2n? + TcnP'^n — \J2n? ^ 


(F.l) 


In the regime of Tc/3^ <C n, (F.l) becomes (approximately) 

for some constant C independent of /? and c. Hence, even for sparse 0* , the regret bound is 0(n). The inequalities used to 
prove the regret bound are all quite tight, so we conjecture that, up to constant factors, there is a matching lower bound, as 
least in the Gaussian regression case. 


F.2. Proof of Theorem F.l 

Apply Theorem 2.2 with Qe*, 4 , = Since po(^) = Ili Lap(0i, /3), we have (see Appendix C.3) 

27^0 


KL(Qe.,0||Po) < ^ ln(7re) + ^E 


|0 *|Jl-exp<i-?^ 




n 2/32 n ^n(j) I ^ . 

< A In ^ - T + ^ + 2 ^ E -- 


TTcj)' 


2 mMo* 


Since Varg^. [9,] = 


T ctjy^ n 


LsayesiZ) < inf Lg*( 2 ') H----ln( 7 re) + 


\/2n(f> n 2/3 


v^/3 




TT(fP' 


Choosing cjp' = 


^yt2n^+Tcn4i^7z —\/2n^ 




gives the desired result. 



























Risk and Regret of Hierarchical Bayesian Learners 


F.3. Proof of Theorem 4.3 

Fix some 6 *. If 0* = 0, then let Qe *^^2 = 5o, so KL(Qg._^ 2 1 |Po) = In If = 0, then let 

KL(Qe.,^2||Po) =KL(Qe.,^2||?^(0,(T2))+ln^. 


The rest of the proof of (14) then closely follows earlier ones. To obtain (15), we observe that if p = then 


,1 ,1 , n 

m In- = m In-< m In ■ 


- IIV O-IA , 

1 — p 1 — gi/" 


1 -q 


and 


, ,1 n — TO 1 1 

[n — to) In - = -In - < hr -. 

p n q q 


so 








